November 10, 2023

Setup and Code Availability

Problem Set-Up

Consider a simple study of the microbiome pre/post antibiotic administration.

  • Research question: Which taxa change in absolute abundance after taking an antibiotic?
  • How do we analyze this data to answer the question?

Sequence Count Data as a Sample

Gap between Data and System

Implications for Analysis

Options for Analysis

  1. Use an existing method for differential abundance.

    • ALDEx2, DESeq2, edgeR, etc.
    • These methods rely on normalizations.

    \[\tilde{\theta_d} = f(Y).\]

  2. Reframe the question (CoDA).

    • Data are compositional.
    • Differential abundance cannot be done rigorously.

    \[\tilde{\theta_d} = \mathrm{mean}_{\mathrm{post}} \, \mathrm{CLR}(Y) - \mathrm{mean}_{\mathrm{pre}} \, \mathrm{CLR}(Y).\]

Problems with these Options

  • Option 1: Disconnect between question and normalization can lead to elevated type-I and type-II errors. Why?

  • Option 2: Can’t answer many research questions.

New Option: Scale Reliant Inference

A middle ground approach:

  • \(Y\) is a measurement of the underlying system \(W\).

  • Desired quantity depends on \(W\) (i.e., \(\theta = f(W)\)). However, \(W\) depends on both the composition and system scale:

\[W_{dn} = W_{dn}^\parallel W_n^\perp\] \[W_n^\perp = \sum_{d=1}^D W_{dn}\]

Scale Simulation Random Variables

Goal: Estimate \(\theta = f(W^\parallel, W^\perp)\).

  1. Draw samples of \(W^{\parallel}\) from a measurement model (can depend on \(Y\)).
  2. Draw samples of \(W^{\perp}\) from a scale model (can depend on \(W^{\parallel}\)).
  3. Estimate samples of \(\theta = f(W^\parallel, W^\perp)\).

Adding Scale to ALDEx2

ALDEx2 Software Suite

ALDEx2 Software Suite

Unacknowledged Bias within ALDEx2

Step 2 places an assumption on system scale.

  • Consider that \(\log W_{dn} = \log W_{dn}^\parallel + \log W_n^\perp\).

  • We can show that the CLR transformation corresponds to the assumption:

\[\log W_n^\perp = - \mathrm{mean}(\log \hat{W}_{\cdot n}^\parallel).\]

Unacknowledged Bias within ALDEx2

Moving Past Normalizations to Scale within ALDEx2

What if we consider error in this assumption?

\[\log W_n^\perp = - \mathrm{mean}(\log \hat{W}_{\cdot n}^\parallel) + \epsilon.\]

Moving Past Normalizations to Scale within ALDEx2

Scale Models in ALDEx2

Normalizations are replaced by a scale model:

\[\log W_n^\perp \sim Q\]

There are no restrictions on \(Q\), although there are some helpful options:

  1. Based on normalizations.
  2. Based on biological knowledge.
  3. Based on outside measurements.

New Additions to ALDEx2

Data Simulation

Simulation functions can be found in scripts > data_simulation.R.

Reading in the Simulated Data

###Analysis file for simulated data
library(ALDEx2)
library(tidyverse)
library(ggplot2)
library(ggpattern)
library(cowplot)

set.seed(12345)
setwd('..')
##Reading in data (see "data_simulation.R" for details.)
rdat <- read.csv(file.path("data/simulation/sim_seq_dat.csv"))
flow_data <- read.csv(file.path("data/simulation/sim_flow_dat.csv"))

Inspecting the Data

## "Y" represents the OTU table
Y <- t(rdat[,-1])
Y[1:4,1:4]
##       [,1] [,2] [,3] [,4]
## Taxa1  657  672  652  676
## Taxa2  669  697  634  652
## Taxa3  663  675  660  675
## Taxa4  666  668  633  670
##Vector denoting whether samples was in
##pre- or post- antibiotic condition.
conds <- as.character(rdat[,1])
conds[c(1:4,51:55)]
## [1] "Pre"  "Pre"  "Pre"  "Pre"  "Post" "Post" "Post" "Post" "Post"

Effects of Unacknowledged Bias

## Fitting and analyzing the original ALDEx2
## model
mod.base <- aldex(Y, conds)
row.names(mod.base %>%
    filter(we.eBH < 0.05))
##  [1] "Taxa1"  "Taxa2"  "Taxa3"  "Taxa4"  "Taxa5"  "Taxa6"  "Taxa7" 
##  [8] "Taxa8"  "Taxa9"  "Taxa10" "Taxa11" "Taxa12" "Taxa13" "Taxa14"
## [15] "Taxa15" "Taxa16" "Taxa17" "Taxa18" "Taxa19" "Taxa20"

Including scale through gamma.

  • Added as argument to the aldex and aldex.clr function.

  • gamma can either be a single numeric or a matrix.

    1. Single numeric: controls the noise on the default scale model.
    2. Matrix: A \(N \times S\) matrix of samples of \(W\).

Option 1: Default Scale Model

The default scale model is based on errors in the CLR normalization.

\[\log \hat{W}_{n}^{\perp(s)} = - \mathrm{mean} \left(\log \hat{W}^{\parallel (s)}_{\cdot n}\right) + \Lambda^\perp x_{n}\] \[\Lambda^\perp \sim \ N(0, \gamma^2).\]

Decoding the Default Scale Model

Decoding the Default Scale Model

Re-creating the CLR with gamma.

mod.clr <- aldex(Y, conds, gamma = 1e-3)
plot(mod.base$effect, mod.clr$effect, xlab = "Original ALDEx2 
  Effect Size", ylab = "CLR Scale Model Effect Size")
abline(a=0,b=1, col = "red", lty = "dashed")

How to Choose \(\gamma\)

We recommend a value of \(\gamma = 0.50\).

  • 95% of the scale samples are within half or double of the amount implied by the normalization.
  • Any value of \(\gamma > 0\) is an improvement over the status quo!!

Option 1: Examples (Some Noise)

## Adding noise via the default scale model
mod.ss <- aldex(Y, conds, gamma = .25)
mod.ss %>% filter(we.eBH < 0.05)
##          rab.all rab.win.Post rab.win.Pre  diff.btw  diff.win
## Taxa4   1.704398     1.527557    2.170773 0.6532608 0.3433438
## Taxa15 -1.484927    -1.824995   -1.093582 0.7627941 0.5892416
## Taxa20 -2.010254    -2.786268   -1.117815 1.7079378 0.6566340
##          effect      overlap        we.ep       we.eBH        wi.ep
## Taxa4  1.813528 0.0278125355 2.990293e-02 3.128161e-02 2.851928e-02
## Taxa15 1.207046 0.0787500118 1.630176e-02 2.984908e-02 1.912061e-02
## Taxa20 2.582495 0.0006266114 5.088655e-19 1.017731e-17 6.106315e-17
##              wi.eBH
## Taxa4  3.135566e-02
## Taxa15 3.774518e-02
## Taxa20 1.156039e-15

3 of 20 taxa (all true positives) are significant!

Option 1: Examples (More Noise)

## Adding more noise via the default scale model
mod.ss.high <- aldex(Y, conds, gamma = 1)
mod.ss.high %>% filter(we.eBH < 0.05)
##  [1] rab.all      rab.win.Post rab.win.Pre  diff.btw     diff.win    
##  [6] effect       overlap      we.ep        we.eBH       wi.ep       
## [11] wi.eBH      
## <0 rows> (or 0-length row.names)

No taxa are significant!

Option 2: Create Your Own Scale Matrix

Alternatively, can pass a matrix of scale samples to gamma so long as:

  1. The dimension is \(N \times S\).
  2. They are samples of \(W^\perp\) not \(\log W^\perp\).

Example 1: Based on Biology

  • Scale is guided by the biological system or the researcher’s prior beliefs.

  • Suppose we expect the antibiotic to decrease the total scale by 10%. This can be written in a scale model:

\[\log \hat{W}_n^{(s)} \sim N(1.0, \gamma^2) \, \, \mathrm{if} \, \, x_n = \mathrm{``pre"}\] \[\log \hat{W}_n^{(s)} \sim N(0.9, \gamma^2) \, \, \mathrm{if} \, \, x_n = \mathrm{``post"}.\]

Example 1: Based on Biology

  • The aldex.makeScaleMatrix function can help with this.

  • Two sample example: aldex.makeScaleMatrix(gamma <- 1, mu <- c(1,.9), conds <- c("pre", "post"), log = FALSE)

  • Note: The absolute scale doesn’t matter. It is the relationship between conditions.

Example 1: Based on Biology

##Creating an informed model using biological reasoning
scales <- c(rep(1, 50), rep(0.9, 50))
scale_samps <- aldex.makeScaleMatrix(.15, scales, conds, log=FALSE)
mod.know <- aldex(Y, conds, gamma = scale_samps)
row.names(mod.know %>% filter(we.eBH < 0.05))
## [1] "Taxa3"  "Taxa4"  "Taxa15" "Taxa20"

Example 2: Outside Measurements

  • There has been a push to collect outside measurements (e.g., qPCR, flow cytometry).

  • These can be used in building a scale model if they are informative on the scale of interest.



\[\hat{W}^{\perp (s)}_{n} \sim N(\mathrm{mean}(z_{1n}, z_{2n}, z_{3n}),\text{var}(z_{1n}, z_{2n}, z_{3n})).\]

Example 2: Flow Cytometry Data

##Inspeciting our flow cytometry data
head(flow_data)
##   sample     flow
## 1      1 30212.92
## 2      1 29536.80
## 3      1 30727.71
## 4      2 30168.53
## 5      2 29740.54
## 6      2 29891.60

Example 2: Flow Cytometry Data

## Now creating an informed model using the flow data
flow_data_collapse <- flow_data %>%
    group_by(sample) %>%
    mutate(mean = mean(flow)) %>%
    mutate(stdev = sd(flow)) %>%
    dplyr::select(-flow) %>%
    ungroup() %>%
    unique()

scale_samps <- matrix(NA, nrow = nrow(flow_data_collapse), ncol = 128)
for (i in 1:nrow(scale_samps)) {
    scale_samps[i, ] <- rnorm(128, flow_data_collapse$mean[i],
        flow_data_collapse$stdev[i])
}

Example 2: Flow Cytometry Data

mod.flow <- aldex(Y, conds, gamma = scale_samps)
row.names(mod.flow %>%
    filter(we.eBH < 0.05))
## [1] "Taxa3"  "Taxa4"  "Taxa15" "Taxa20"

Comparing Results between Scale Models

Sensitivity Analyses in ALDEx2

Example: Sensitivity Analyses

##Now running a sensitivity analysis over the default scale model

##First, specifying different values for the noise in the scale
gamma_to_test <- c(1e-3, .1, .25, .5, .75, 1, 2, 3, 4, 5)

##Run the CLR function
clr <- aldex.clr(Y, conds)

##Run sensitivity analysis function
sen_res <- aldex.senAnalysis(clr, gamma = gamma_to_test)

##Plotting the sensitivity results.
plotGamma(sen_res, thresh = .1)

Example: Sensitivity Analyses

Example: Sensitivity Analyses

A Real-Data Example with SELEX

Data Description

  • SELEX: Selected Growth Experiment.
  • Goal is to determine which of 1,600 gene variants confer a growth phenotype.
  • Study was conducted so that those variants that confer a growth phenotype would increase in the presence of a bacteriostatic toxin.

Data Description

  • Two conditions present: selected and non-selected.
    1. Selected: Bacteriostatic toxin allows for selected growth in some cell lines.
    2. Non-selected: Growth inhibitor leads to no growth in any of the cell lines.
  • Originally published in McMurrough et. al (2014) and discussed in Fernandes et. al (2014).

Reading in the Data

library(ALDEx2)
library(tidyverse)
library(ggrepel)

conds <- c(rep("NS", 7), rep("S", 7))

data(selex)
selex[1:4,1:4]
##         X1_ANS X1_BNS X1_CNS X1_DNS
## A:D:A:D    347    271    396    317
## A:D:A:E    436    361    461    241
## A:E:A:D    476    288    378    215
## A:E:A:E    513    307    424    381

We have access to ground truth data.

##Ground truth
setwd('..')
true.pos <- read.delim(file.path("data","selex","selex-truth.txt"))$X
true.pos
##  [1] "A:E:G:E" "A:D:G:E" "A:E:G:D" "A:D:G:D" "G:E:A:E" "G:D:A:E"
##  [7] "G:E:A:D" "G:D:A:D" "A:E:A:E" "A:D:A:E" "A:E:A:D" "A:D:A:D"
## [13] "G:E:G:E" "G:D:G:E" "G:E:G:D" "G:D:G:D" "S:E:G:E" "S:D:G:E"
## [19] "S:E:G:D" "S:D:G:D" "G:E:S:E" "G:D:S:E" "G:E:S:D" "G:D:S:D"
## [25] "A:E:S:E" "S:E:A:E" "S:E:S:E"

Modeling with ALDEx2 (no scale)

##ALDEx2
aldex.mod <- aldex(selex,conds)
aldex.pos <- aldex.mod %>% filter(we.eBH < 0.05)
table(row.names(aldex.pos) %in% true.pos)
## 
## FALSE  TRUE 
##    52    18

Modeling with ALDEx2 (default scale)

##Adding scale
aldex.scale <- aldex(selex, conds, gamma = .5)
aldex.scale.pos <- aldex.scale %>% filter(we.eBH < 0.05)
table(row.names(aldex.scale.pos) %in% true.pos)
## 
## FALSE  TRUE 
##    38    17

Testing over Gamma (default scale)

##Plotting TP and FP over different levels of gamma
gamma.to.test <- c(1e-3,.25,.5,.75,1,2,3, 4, 5)

TP <- rep(NA, length(gamma.to.test))
FP <- rep(NA, length(gamma.to.test))

for(i in 1:length(gamma.to.test)){
  mod <- aldex(selex,conds,gamma = gamma.to.test[i])
  mod.pos <- row.names(mod %>% filter(we.eBH < 0.05))

  TP[i] <- sum(mod.pos %in% true.pos)
  FP[i] <- sum(!(mod.pos %in% true.pos))
}

Testing over Gamma (default scale)

graph.df <- data.frame("Gamma" = rep(gamma.to.test, 2),
                       "Counts" = c(TP, FP),
                       "Group" = c(rep("TP", length(gamma.to.test)),
                                   rep("FP", length(gamma.to.test))))

ggplot(graph.df, aes(x = Gamma, y = Counts,
                     group = Group, color = Group)) +
  geom_line(aes(linetype = Group), lwd = 1.25) +
  scale_color_manual(values = c("red3", "royalblue3")) +
  scale_linetype_manual(values=c("twodash", "longdash")) +
  theme_bw() +
  ylab("Counts")

Testing over Gamma (default scale)

Built-In Sensitivity Analyses

##Built-in sensitivity analysis
clr <- aldex.clr(selex, conds)

##Run sensitivity analysis function
sen_res <- aldex.senAnalysis(clr, gamma = gamma.to.test)

##Plotting the sensitivity results.
plotGamma(sen_res, thresh = .1)

Built-In Sensitivity Analyses

References

  • Nixon, et. al. (2023) “Scale Reliant Inference.” ArXiv.

  • Gloor, Nixon, and Silverman. (2023) “Scale is Not What You Think; Explicit Scale Simulation in ALDEx2.” BioRXiv.

  • Nixon, Gloor, and Silverman. (2023) “Beyond Normalizations: Incorporating Scale Uncertainty in ALDEx2.” Forthcoming.

  • McMurrough et. al. (2014).”Control of catalytic efficiency by a coevolving network of catalytic and noncatalytic residues.” PNAS.

  • Fernandes et. al. (2014). “Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis.” Microbiome.